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Speaker: Nikolay Samokhvalov 


o Database systems: 


o 2002-2005: “SEE Server 200 


PostgreSQL 


o Worked on XML data type and functions (2005-2007) 
o Long-term community activist - #RuPostgres, Postgres.tv 


o since 2005: 


o Conferences Program Committee all highload” ec ze PGIBZE etc. 


o Current business: Si Postgres.al 


@ Created/reviewed more than 1,000 DB migrations 
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Sili Postgres.ai 


— clone DB of any size in a few seconds in bring them in any 
point of the DevOps lifecycle 

- automated (in Cl) testing of DB migrations 

- guess-free SQL optimization 

- Instantly deploy full-size staging apps 


Ag Gitlab chevycom Mmiro NUTAND<. 
Qan “CDK =B UNGRES 
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Fresh version of these slides 


bit.ly/highload2021 


- comments are open (and welcome!) 


W 


This talks goals 


© see some examples of mistakes, horror stories 


© learn something new 
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This talks goals 


© see some examples of mistakes, horror stories 


© learn something new 


how avoid downtime and issues — learn principles 


see concrete path to having downtime-free process 
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Terminology 


DML - database manipulation language 
(SELECT / INSERT / UPDATE / DELETE, etc.) 


DDL — data definition language 
(CREATE «ALTER «i, DROP...) 


DB migrations — planned, incremental changes 
of DB schema and/or data 
DB schema migration & data migration 


DB schema evolution, schema versioning 
DB change management, and so on 
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Applying a schema migration to 
a production database is always 
a risk 


Wikipedia 


https://en.wikipedia.org/wiki/Schema_migration 
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Types of mistakes 


Schema mismatch 
Heavy operation (processing too much data) 
Blocked (cannot acquire lock) 


Blocker (holding heavy lock) 


oe eS 


Post-deployment issues 
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DB change — risk classification 


Too much work later / for others 


Ch i 
Change fails or e 
is being blocked aher 


Too much work now / for us 
(to apply the change) 
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Ideal Change 


Too much work later / for others 


Ch i 
Change fails or eens 
is being blocked alier 


Too much work now / for us 
(to apply the change) 
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Schema mismatch 


Too much work later / for others 


Ch i 
Change fails or eens 
is being blocked others 


Too much work now / for us 
(to apply the change) 
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Heavy operation 


Too much work later / for others 


Ch i 
Change fails or Pon 
is being blocked others 


Too much work now / for us 
(to apply the change) 
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Blocked (cannot acquire lock) 


Too much work later / for others 


Ch i 
Change fails or e 
is being blocked ethers 


Too much work now / for us 
(to apply the change) 
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Blocker (holding heavy lock) 


Too much work later / for others 


Ch i 
Change fails or e 
is being blocked others 


Too much work now / for us 
(to apply the change) 
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Post-deployment issues 


Too much work later / for others 


Ch i 
Change fails or e 
is being blocked aher 


Too much work now / for us 
(to apply the change) 
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DB changes — risk classification 


Too much work later / for others 


. i Change is 
Change fails or E Downtime blocking 
is being blocked NE others 


Too much work now / for us 
(to apply the change) 
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Example #1 


create table t1 ( 
id int primary key, 
val text 


Ji 


-- dev, test, QA, staging, whatever - OK 


== prod: 
ERROR: relation "ti" already exists 
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Example #1 


Heavy 
work later 


create table t1 ( 
id int primary key, blocked = others 
val text Heavy 


work now 


Failed or Blocking 


Ji 


-- dev, test, QA, staging, whatever - OK 


-- prod: 
ERROR: relation "ti" already exists 
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IF [NOT] EXISTS 


create table if not exists t1 ( 
id int primary key, 
val text 
)5 
NOTICE: relation "ti" already exists, skipping 


CREATE TABLE 
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Start using DB schema migration tool 


7 Z  SSOITCH 


Flyway Liquibase 


* ` 

4, w n 
= a 
` 
a m 


RAILS 


django 99 uiiframework 


e 
n 


Test changes in Cl 


- Both DO and UNDO steps are supported (can revert) 
- Cl: test them all 
- Better: DO, UNDO, and DO again 
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Test changes in Cl 


- Both DO and UNDO steps are supported (can revert) 
- Cl: test them all 
- Better: DO, UNDO, and DO again 


Now guess what... 


“Thanks” to IF NOT EXISTS, we now may leave UNDO empty! 


s 
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MK Don't: 


- IF [NOT] EXIST 


Do: 


- test DO-UNDO-DO in Cl 
- keep schema up to date in all envs 
- dont ignore or work-around errors 
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The Landscape of the Database Testing (app dev) 


Static analysis 


Change 


Dynamic analysis 
(performance) 


GS» 
Schema = Data 
i 


Test schema Test data 


“micro” EXPLAIN (ANALYZE, BUFFERS) 


“macro” Load testing, benchmarks 
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Reliable database changes — the hierarchy of needs 


Actual, realistic testing Extremely few 


Review and approval process (manual) Some 


Many 


Version control for DB changes: Git & Flyway / Sqitch / Liquibase / smth else All 
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Actual, realistic testing Extremely few 


Review and approval process (manual) Some 


Many 


Version control for DB changes: Git & Flyway / Sqitch / Liquibase / smth else All 
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You 2021-05-16 11:29:58 


exec create table t1 as 
select id::int, random()::text as val Exal | iple #2 


from generate_series(1, 10000000) id; 


alter table t1 add primary key (id); 


fa Joe Bot 2021-05-16 11:29:59 
` 


exec create table t1 as select id::int, random()::text as val from generate series(1, 10000000) id; alter table ti 
add primary key (id); 


Session: webui-i4038 


% time seconds wait_event 

64.82 9.447511 Running 

7.92 1.154220 LWLock.WALWriteLock 
6.94 1.011216 10.DataFileExtend 
5.69 Q.829122 IO.WALWrite 

5.27 0.767460 IO.WALSync 

2255 @.370954 IO.DataFileWrite 
2.06 0.300581 IO.BufFileWrite 
2.04 @.297535 10.DataFileRead 
oil 0.220348 10.DataFileImmediateSync 
1.21 Q.176163 IO.BufFileRead 


100.00 14.575110 


The query has been executed. Duration: 14.575 s (estimated for prod: 13.518...116.725 s) 
Estimated timing for production (experimental). How it works 


Command 
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Example #2 — limited duration (15s) 


You 2021-05-16 11:43:16 
exec set statement timeout to '15s'; update t1 set val = replace(val, 'Q159', 'OiSg'): 


fz Joe Bot 2021-05-16 11:43:16 
w “ dë 


= 


exec set statement_timeout to '15s'; update t1 set val = replace(val, 'Q159', 'OiSg'); 
Session: webui-—i4038 


ERROR: ERROR: canceling statement due to statement timeout (SQLSTATE 57014) 


X Failed 
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Example #2 — limited duration (15s) 


You 2021-05-16 11:43:16 
exec set statement timeout to '15s'; update t1 set val = replace(val, '@159', 'OiSg'): 


a Joe Bot 2021-05-16 11:43:16 


exec set statement timeout to '15s'; update tl set val = replace(val, '0159', 'OiSg'); 
Session: webui-i4038 


ERROR: ERROR: canceling statement due to statement timeout (SQLSTATE 57014) 


X Failed TA 


work later 


Failed or Blocking 


blocked others 


Heavy 
work now 
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Example #2 — unlimited duration 


You 2021-05-16 12:00:11 
exec set statement timeout to 0: update t1 set val = replace(val, '@159', '0iSg'); 


f=) Joe Bot 2021-05-16 12:00:12 
{= 


exec set statement timeout to 0; update t1 set val = replace(val, '@159', 'OiSg'); 
Session: webui-i4038 


% time seconds wait_event 

70.34 31.070133 Running 

14.99 6.621164 LWLock.WALWriteLock 

4.46 1.972113 I0.WALInitWrite 

3.65 1.611055 10.DataFileExtend 

3.54 1.564610 IO.WALInitSync 

1.38 @.608596 IO.WALWrite 

1.33 @.588894 10.DataFileRead 

0.20 0.089901 LWLock.WALBufMappingLock 
0.10 0.044417 IO.WALSync 


100.00 44.170883 
The query has been executed. Duration: 44.171 s (estimated for prod: 42.615...43.106 s) 
Estimated timing for production (experimental). How it works 


@ Completed 
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Example #2 — unlimited duration 


You 2021-05-16 12:00:11 
exec set statement timeout to 0; update t1 set val = replace(val, '0159', 


fem Joe Bot 2021-05-16 12:00:12 


“i 


exec set statement timeout to 0; update t1 set val = replace(val, '@159', 'OiSg'): 


Session: webui-i4038 


31. 070133 


70.34 
14.99 
4.46 
3.65 
3.54 


100.00 44.170883 


6.621164 
1.972113 
1.611055 
1.564610 
0.608596 
0.588894 
0.089901 
0.044417 


vait event 


Running 
LWLock.WALWriteLock 
I0.WALInitWrite Failed or 
10.DataFileExtend blocked 
IO.WALInitSync 
IO.WALWrite 
IO.DataFileRead 
LWLock.WALBufMappingLock 
IO.WALSync 


The query has been executed. Duration: 44.171 s (estimated for prod: 42.615...43.106 s) 
Estimated timing for production (experimental). How it works 


@ Completed 


'0iSg'); 


Heavy 
work later 


Heavy 
work now 


— Blocking 


others 
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Example #2 — diagnostics: rows, buffers 


test=# explain (buffers, analyze) update t1 
set val = replace(val, 'Q159', 'OiSg'); 


QUERY PLAN 


Update on ti (cost=0.00..189165.00 rows=10Q00000 width=42) (actual time=76024.507..76024.5@8 rows=0 loops=1) 
Buffers: shared hit=60154265 read=91606 dirtied=183191 written=198198 
-> Seq Scan on ti (cost=0@.00..189165.00 rows=1000000OWidth=42) (actual time=@.367..2227.103 rows=10000000 


loops=1) 
Buffers: shared read=64165 written=37703 
Planning: sèn 5 
Buffers: shared hit=17 read=1 dirtied=1 hit: ~459 GiB 
Planning Time: @.497 ms read: ~/16 MiB 
Execution Time: 76024.546 ms ae E 
(E rawe) dirtied: ~1.4 GiB 


written: ~1.5 GiB 


Time: 76030.399 ms (@1:16.030) 


(with awful PG default settings) Si Postgres.ai 


Example #2 — UPDATEs vs. Bloat 


test=# create table al as select 1::int as i; 
SELECT 1 


test=# select ctid, * from a1; 


ctid |i 

-= = m m a a +--- 
(8,1) | 1 
(1 row) 


test=# update al set i = i; 
UPDATE 1 

test=# select ctid, * from a1; 
ctid |i 
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Example #2 — what to do 


Reduce the scope of work: 
- Split to batches 
- Temporary index to speed up lookups 


- Avoid useless, silly updates 


Avoid locking longer than 1s 
Control dead tuples / bloat 
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Example #3 — int4 PK problem 


test=# insert into ti select 2431, 7, 
ERROR: integer out of range 
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Example #3 — naive method 


test=# alter table ti alter column id type ints; 


ALTER TABLE 
Time: 273726.829 ms (@4:33.727) 
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Example #3 — ways to solve int4 PK problem 


Avoid: 
1a) Stop writing to the table 
1b) Use negative values — another space of 2431-1 values 


Transform without downtime: 


2a) “New column” method 
2b) “New table” method 
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Example #3 — The “New column” method 


- Create a int8 column 
- Install a trigger to copy value for all fresh rows 
Backfill the values for the existing rows 
- Redefine PK ———— a PK needs two things: 
- A unique index 
- NOT NULL constraint 
W both these are not trivial 


- Finally, all FKs referring to the old PK need to be redefined 
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Example #3 — The “New column” method 
How to create a unique index without downtime: 


create unique index concurrently on tbl(new_int8_column) ; 
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Example #3 — The “New column” method 

How to create a unique index without downtime: 

create unique index concurrently on tbl(new int8 column) ; 
- might fail — it's normal 
- if failed, leaves an INVALID index behind 


- cleanup & retry logic is needed 
(but not DROP IF EXISTS) 
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Example #3 — The “New column” method 


How to create a unique index without downtime: 


create unique index concurrently on tbl(new int8 column); 


- might fail — it's normal 

- if failed, leaves an INVALID index behind 

- cleanup & retry logic is needed 
(but not DROP IF EXISTS) 


Heavy 
work later 


Failed or Blocking 


blocked others 


Heavy 
work now 
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Example #3 — The “New column” method 


How to add NOT NULL without downtime? 


X Before Postgres 11 — impossible without downtime 
- NOT NULL constraint is not an “online” operation 
- CHECK (.. IS NOT NULL) is not “enough” for a PK 


Postgres 11+ trick: 
- alter table ... add column .. not null default -1; 
- Then "fix" all the -1 values 
- Finally, drop the DEFAULT 
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Example #3 — The “New table” method 


- CDC: a trigger + “delta” table to keep track of changes 
(or logical replication) 

- REPEATABLE READ and snapshot export to get the initial data 
- Take care of the constraints, indexes and all FKs 

- Redefining a FK is also not trivial: 

add NOT VALID (and VALIDATE after switching) 

- It’s even more tricky: FKs should be DISABLED till after switching 
- Switch from the old table to the new one 

- ina single transaction 

- catching up the CDC “tail” inside the transaction 
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Final example — chain of blockers 


session 1: 
begin; update t1 set id = id where id = 1; -- and sit waiting 
session 2: 
alter table ti add column one more int8; pee 
Failed or Blocking 
blocked others 
session 3: u 
eavy 
select * from t1 where id = 2; -- boom! a 
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Final example — chain of blockers 


change age | pid | wait event type | wait event | blocked_by_pids | state | lvl | blocking others | latest_query_in_tx 
------------ +-------+-----------------+------------+-----------------+--------+-----+-----------------+-------------------------------------------- 
00:06:41 | 28706 | Client | ClientRead | {} | idletx | 0 | 1 | update ti set id = id where id = 1; 
00:06:37 | 28709 | Lock | relation | (28706) | active | 1 | 1 | . alter table ti add column one more int8; 
00:06:28 | 28725 | Lock | relation | (28709) | active | 2 | o | . select * from t1 where id = 2; 

(3 rows) 


“Forest of lock trees” https://gitlab.com/-/snippets/1890428 
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Ideal ALTER: lock_timeout & retries — use pl/ogsal 


perform set config('lock timeout', lock timeout, false); -- 5@ms or so 


for i in 1..max_attempts loop 
begin 
execute ‘alter table ti add column ni ints8'; 
ddl completed := true; 
exit; 
exception when lock_not_available then Pea 
raise notice ‘ALTER attempts: #% failed’, i; 
end; ' 
end loop; Failedor  | Blocking 


How to run short ALTER TABLE VE work now 


without long locking concurrent queries 
https://www.depesz.com/2019/09/26/how-to-run-short-alter-table-without-long-locking-concurrent-queries/ 


(see the comment by Mikhail Velikikh) 
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How to become a “pro” 


1. Test everything 


How to become a “pro” 


1. Test everything 


2. Make testing convenient 


Database Migration Testing with Database Lab 


Hierarchy of Database Testing 


- Realistic migration testing is hard 


Automated Realistic Testing 


Manual Migration Testing on Full DB 


- No testing = unexpected problems 


Automated Migration Testing on a “Toy” DB 


Version control for DB changes: Git & Flyway / Sqitch / Liquibase / etc. 
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Database Migration Testing with Database Lab 


- Realistic migration testing is hard e 


Automated Realistic Testing 


Manual Migration Testing on Full DB 


- No testing = unexpected problems 


Automated Migration Testing on a “Toy” DB 


Version control for DB changes: Git & Flyway / Sqitch / Liquibase / etc. 


(a 


A Database Lab makes realistic testing easy 


Pipeline Needs Jobs 2 Tests 0 


Clone_request Db_migrate 


©) clone request (43 ©) db_migrate 3 
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Thank you! 


Slack (EN): slack.postgres.ai 
Telegram (RU): t.me/databaselabru 


Join the Database Lab Customer Advisory Group: 
https://postgres.ai/customer-advisory-group 
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AU BE 
LON TIN HF p aba ayi 


Some examples of failures due to lack of testing 


Incompatible changes — production has different DB schema than dev 8 test 
Cannot deploy — hitting statement_timeout — too heavy operations 


During deployment, we've got a failover 
Deployment lasted 10 minutes, the app was very slow (or even down) 


Two weeks after deployment, we realize that the high bloat growth 
we have now has been introduced by that deployment 
Deployment succeeded, but then we have started to see errors 
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We need better tools 
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SCIENTIFIC 


ONE DOLLAR 


March 1973 


BICYCLE TECHNOLOGY 


ST OF TRANSPORT (CALORIES PER GRAM PER KILOMETER) 


BLOWF 
. . 
FRUIT FLY 8B 


o 
MICE e 
tea 
@ LEMMING 
LY 
Kë a 
EE RAT 
LOCUST 
e o 
. RABBIT 
HUMMINGBIRD 
o 
BUDGERIGAR 
. 
GULL 
a 
PIGEON 
e 
SALMON 
10= 10-3 10°? 10 + 
BODY WEIGHT (KILC 


. 
HELICOPTER 


JET FIGHTER 
o 
OLIGHT PLANE 
cow 
Kë e 
PS AUTOMOBILES 
e 
JET TRANSPORT 


OMAN ON BICYCLE 


107 103 104 105 
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Steve Jobs (1980) 


1) We, humans, are great tool-makers. 
We amplify human abilities. 


2) Something special happens 
when you have 1 computer and 1 person. 


It's very different that having 1 computer and 10 persons. 
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Traditional DB experiments — thick clones 


I \ 

I l 

I I 

I l 

> I I 

I I 

I I 

I I 

aie I l 

, \ / 

Production sins: sons ‘ates sali sams amy daa ne a I 
G 

a © o le 

Se @® 


“1 database copy - 10 persons” 


Se se se ea 


( 


a se ne se ne ne me aa 
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Database Lab: use thin clones 


Production 


“1 database copy — 1 person’ 
SH Postgres.ai 


“Thin clones’ — Copy-on-Write (CoVV) 


4 Thick copy of production (any size] 


© Thin clone [size starts from 1 MB, depends on changes) 
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Database Lab — Open-core model 


42489 
uuuy 
uuuy 


The Database Lab Engine (DLE) The Platform (SaaS) 
Open-source (AGPLv3) Proprietary (freemium) 
- Thin cloning — API & CLI - Web console — GUI 

- Automated provisioning and data refresh - Access control, audit 
- Data transformation, anonymization - History, visualization 


- Supports managed Postgres (AWS RDS, etc.)  - Support 


https://gitlab.com/postqres-ai/database-lab https://postares.ai/ 
M use these links to start using it for your databases ^^ 
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“ 


Database Lab unlocks ‘Shift-left testing” 


Development bottlenecks Frictionless development 
(with standard staging DB) (with Database Lab) 


x Bugs: difficult to reproduce, easy to miss © Bugs: easy to reproduce, and fix early 

X Not 100% of changes are well-verified © 100% of changes are well-verified 

x SQL optimization is hard 9 SQL optimization can be done by anyone 
X Each non-prod big DB costs a lot © Non-prod DB refresh takes seconds 

X Non-prod DB refresh takes hours, days, weeks 9 Extra non-prod DBs doesn't cost a penny 
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Database experiments on thin clones — yes and no 


Yes No 


- Check execution plan — Joe bot - Load testing 
- EXPLAIN w/o execution - Regular HA/DR goals 
- EXPLAIN (ANALYZE, BUFFERS) - backups 
- (timing is different; structure and buffer - (but useful to check 


numbers — the same 
) WAL stream, recover 


- Check on we records by mistake) 
-  Inaex Id€as (JOE DO 
- hot standby 


- auto-check DB migrations (CI Observer) 
F | w en (but useful to offload very 
- Heavy, long queries: analytics, dump/restore long-running SELECTS) 
- No penalties! 
(think hot_standby_feedback, locks, CPU) 


e 
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DB migration testing — ‘stateful tests in Cl” 


What we want from testing of DB changes: 
- Ensure the change is valid 
- |t will be executed in appropriate time 


- |t wont put the system down 
... and; 


- What to expect? (New objects, size change, duration, etc.) 
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Perfect Lab for database experiments 


- Realistic conditions — as similar to production as possible 


The same schema, data, environment as on production 


Very similar background workload 


- Full automation 
- “Memory” (store, share details) 
- Low iteration overhead (time & money) 


- Everyone can test independently 


allowed to fail — allowed to learn 
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Database experiments with Database Lab today (2021) 


- Realistic conditions — as similar to production as possible 
The same schema, data, environment as on production 
te errr 
- Fine automation 
- “Memory” (store, share details) 
- Low iteration overhead (time & money) 


- Everyone can test independently 


able to fail — able to learn 


SH Postgres.ai 


Why Database Lab was created 


- Containers, OverlayFS (file-level CoW) 
Cl:docker pull ... && docker run ... 


— OK only for tiny (< a few GiB) databases 


- Existing solutions: Oracle Snap Clones, Delphix, Actifio, etc. 
SSSS, not open 


— OK only for very large enterprises 
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Companies that do need it today 


- 10+ engineers 

- Multiple backend teams (or plans to split soon) 
- Microservices (or plans to move to them) 

- 100+ GiB databases 


- Frequent releases 
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Database Lab — a high-level overview (with SaaS) 


Production is not affected Unobtrusive 
and can use any filesystem statistics collection 


Production Postgres-checkup 
databases agent 


Uses ZFS or LVM 


Backups Database Lab 
Engine instances 


— 
= Tunnel 
= 


Joe 
instance 


Data does not leave your infrastructure Your infrastructure 


— Data flow 
--> Metadata flow [clone management, query plans, etc.) 


Abstract queries, 
metadata 


Clone management 


Queries, plans, 
metadata 


Database Lab 
Platform 
(SaaS] 


Tunnel 


server 


Postgres.ai 
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Inside the Database Lab Engine 2.x 


The main & 


container Clone. 


“sync” 'dblab Port: 6001 mi 
container GEIS ENC 6001 Test of a 


DB migration 


= control API shared_buffers: 1Gi 600x 


shared_buffers: 8Gi shared_buffers: 1Gi 


Test of a 


shared_buffers: 1Gi DB migration 
vw 


Test of a 


Shared cache (OpenZFS: ARC): 50% of RAM atl a 


1 (or N) physical disk(s) + CoW support 
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DLE — the data flow (physical mode) 


Independent PGDATA ready, 
Ready to accept any DDL, DML 


Clone 
destroyed 


The “sync” instance 


Independent PGDATA ready, 


applies WALs Ready to accept any DDL, DML 


continuously 


snapshot A 


snapshot B 
snapshot C 


The main PGDATA version 
— the lag behind production 
is usually a few seconds 


“Golden copy” 
(initial thick clone) 


A new snapshot is created every N hours Si) Postgres.ai 


How snapshots are created (ZFS version) 


- Create a “pre” ZFS snapshot (R/O) 
- Create a “pre” ZFS clone (R/W) 
- DLE launches a temporary “promote” container 
- If needed, performs “preprocessing” steps (bash) 
- Uses pre' clone to run Postgres and promote it to primary state 
- If needed, performs “preprocessing” SQL queries 
- Performs a clean shutdown of Postgres 


- Create a final ZFS snapshot that will be used for cloning 
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Major topics of automated (Cl) testing on thin clones 


= Security 
https://postares.ai/docs/platform/security 


- Capturing dangerous locks 


Cl Observer: https://postgres.ai/docs/database-lab/cli-reference#subcommand-start-observation 


- Forecast production timing 


Timing estimator: https://postares.ai/docs/database-lab/timing-estimator 
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Making the process secure: where to place the DLE? 


Pll h Cre 


mms he big wall 


Dev & Test 


re 
ci rune” l 


PA 
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Where to place the DLE? Current approach 


Pil h Cre 


mms he big wall 


High, i 
gh levej API. , 
p a 
Is GitHub 


Dev & Test | a misi 


re l E 
ca runne! vë 


Jenkins 
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How it looks like: Cl part 


Example: GitHub Actions: 


©) Search or jump to... Pullrequests Issues Marketplace Explore 


Code Issues Pull requests 1 © Actions Projects 


€ bad migration .github/workflows/main.yml #97 


Q Summar. E 
y CI migration 
led 2 da 
Jobs 
© CI migration > Set up job 


Checkout 


Run migrations 


Q Upload artifacts 


/) Get the response status 
Post Checkout 


Complete job 


Wiki 


Security 


Insights 
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More about dangerous lock detection 


Postgres.ai Console B Nikolay @) 


Organization Switch 


Demo 


og 


EJ 


Q 


Dashboard 


Database Lab 


Instances 


Observed sessions 


SQL Optimization 


Ask Joe 


History 


Checkup 


Reports 


Settings 
General 
Members 
Access tokens 
Billing 


Audit 


Documentation 


Ask support 


Organizations / Demo / Observed sessions / Database Lab observed session #166 


Database Lab observed session #166 


Summary 


Status: X Failed 


Session: #166 
Project: - 


DLE instance: 


Duration: 2m, 5s 
Created: 2 days ago 
Branch: master 


Commit: - 
Triggered by: - 
PR/MR: - 


Checklist 


X Failed Dangerous locks are not observed during the session 
(125 intervals with locks of 1 allowed) 


Session duration is within allowed interval 
(spent 2m, 5s of the allowed 5m) 


Observed intervals and details 


Hide intervals ^ 


Started at 
v 2021-02-26 16:18:16 UTC 
v 2021-02-26 16:18:17 UTC 
A 2021-02-26 16:18:18 UTC 


Duration 
1s 
1s 
1s 


{"datname":"test","relation":"pgbench_branches",""transactionid": null, "mode" :"AccessExclusiveLock","locktype 


"ananvi. 


dran tahle nabanch hranchac:" "query ctart"."JA71-A2-JAT1A:1R:1R AZ1QZQLAN- AA" "atata" 


«"idla in 


relation 


"grante! 


= 


d" : true, "usename" :"dblab_user_1" 


SH Postgres.ai 


“yé Overview 26 Commits 5 Pipelines 7 Changes 7 6 unresolved threads [P @ ^ 


a E Dmytro Zaporozhets (DZ) @dzaporozhets - 1 week ago Owner) QD 3 
ba @abrandl as per 154466 (comment 511910471) can you please review this merge request? 

B 

D Pik gitlab-org/database-team/gitlab-com-database-testing @project_278964_bot2 - 1 week ago Maintainer © DO : 
E A 
ka 7 . . 

| æ= Database migrations 
n 
Migrations included in this change have been executed on gitlab.com data for testing purposes. For details, please see the migration testing pipeline (limited access). Note that 

= this includes pending migrations from master . 

a Migration Total runtime Result DB size change 

a 20210215144909 1.218 +0.00 B 

L) 

a 20210218105431 0.6 s 3 40.00 B 

shi Migration: 20210215144909 

X e Duration: 1.2 s 

SË e Database size change: +0.00 B 

Ba 


Migration: 20210218105431 


e Duration: 0.6 s 
e Database size change: +0.00 B 


Query Calls Total Time Max Time Mean Time Rows 


ALTER TABLE "ci_builds" DROP COLUMN "artifacts file" /*xapplication:test*/ 1 12.9 ms 12.9 ms 12.9 ms 0 


Artifacts 


e Database testing statistics 
e Database Lab Instance 


February 19, 2021 — https.//gitlab.com/gitlab-org/gitlab/-/merge requests/54564#note 512678910 Siii Postgres.ai 


Example: GitLab.com, testing database changes using Database Lab 
- Full automation 
- GitLab CI/CD pipelines securely work with Database Lab 


Database Lab clones ~10 TiB database in ~10 seconds 


Read their blueprint: 


https://docs.gitlab.com/ee/architecture/blueprints/database_testing/ 


SH Postgres.ai 


More about production timing estimation 


Experimental, WIP: https://postgres.ai/docs/database-lab/timing-estimator 


estimator: 
readRatio: 1 
writeRatio: 1 
profilingInterval: 2@ms 
sampleThreshold: 100 


Profiling process 63 with 1@ms sampling 
e s d i 


«715111 10.DataFi id 
7.893916 Running 
1.097738 10.DataFileExtend 
0.787341 LWLock.WALWriteLock 
0.696663 10.BufrileRead 
0.662457 IO.BufFilewrite 
0.654081 I0.WALInitWrite 
0.499461 IO.WALInitSync 
0.335660 IO.WALWrite 
0.301637 10.DataFileImmediateSync 
0.250249 10.WALSync 
0.020805 LWLock.WALBufMappingLock 


100.00 30.915119 


Summary: 


Time: 3.148 s 
- planning: 0.168 ms 
- execution: 3.147 s Cestimated* for prod: 2.465...2.693 s) 
- I/0 read: 627.267 ms 
- I/0 write: 3.644 ms 


Shared buffers: 
- hits: 1016393 (~7.8@ GiB) from the buffer pool 
- reads: 16395 (~128.10 MiB) from the OS file cache, including disk I/0 
- dirtied: 16395 (~128.1@ MiB) 
- writes: 280 (~2.2@ MiB) 


SH Postgres.ai 


Summary — available in PR/MR and visible to whole team 


- When, who, status 

- Duration (in the Lab + estimated for production) 
- Size changes, new objects 

- Dangerous locks 

- Error stats 

- Transaction stats 

- Query analysis summary 

- Tuple stats 

- WAL generated, checkpoitner/bgwriter stats 

- Temp files stats 


Example (WIP): https://aitlab.com/postares-ai/database-lab/-/snippets/2083427 


SH Postgres.ai 


More artifacts, details — restricted access 


- System monitoring (resources utilization) 
- pg_stat_* 

- pg_stat_statements, pg_stat_kcache 

- logerrors 

- Postgres log 

- pgBadger (html, json) 

- wait event sampling 

- perf tracing, flamegraphs; or eBPF 

- Estimated production timing 


https://gitlab.com/postaqres-ai/database-lab/-/issues/226 


SH Postgres.ai 


Database Lab Roadmap 
https://postares.ai/docs/roadmap 


- Lower the entry bar 
- Simplify installation 
- Simplify the use 


- Easy to integrate 


- KKK KKKK k KKKKKKK 


SH Postgres.ai 


Where to start 


Postgres.ai/docs/ 


